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GENERALIZED ADDITIVE MODEL SELECTION 

By Alexandra Chouldechova and Trevor Hastie* 
Carnegie Mellon and Stanford University 

We introduce GAMSEL (Generalized Additive Model Selection), 
a penalized likelihood approach for htting sparse generalized additive 
models in high dimension. Our method interpolates between null, 
linear and additive models by allowing the effect of each variable to 
be estimated as being either zero, linear, or a low-complexity curve, 
as determined by the data. We present a blockwise coordinate descent 
procedure for efficiently optimizing the penalized likelihood objective 
over a dense grid of the tuning parameter, producing a regularization 
path of additive models. We demonstrate the performance of our 
method on both real and simulated data examples, and compare it 
with existing techniques for additive model selection. 


1. Introduction. In many applications it may be too restrictive to suppose that the ef¬ 
fect of all of the predictors is captured by a simple linear fit of the form, r/(xj) = 

Generalized additive models, introduced in Hastie and Tibshirani (1986), allow for greater 
flexibility by modeling the linear predictor of a generalized linear model as a sum of more 
general functions of each variable: 


p 

r]{xi) = '^fjixij), 

where the fj are unknown functions, assumed to be smooth or otherwise low-complexity. 
While generalized additive models have become widely used since their introduction, their 
applicability has until recently been limited to problem settings where the number of pre¬ 
dictors, p, is modest relative to the number of observations, n. In this paper we propose a 
method for generalized additive model selection and estimation that scales well to problems 
with many more predictors than can be reasonably handled with standard methods. 

In large data settings it is often fair to assume that a large number of the measured vari¬ 
ables are irrelevant or redundant for the purpose of predicting the response. It is therefore 
desirable to produce estimators that are sparse, in the sense that fj = 0 for some, or even 
most, predictors. Furthermore, in many applications it may be reasonable to assume that 
the linear model, fj{x) = /djX, is adequate for many of the predictors. Linear relationships 
are easy to interpret, and the widespread use of linear models suggests that linear fits are 
often believed to be good approximations in practice. For purely linear models, the lasso 
(Tibshirani, 1996) is an effective form of regularization that performs model selection. The 
package glmnet (Friedman, Hastie and Tibshirani, 2010) implements the lasso regulariza¬ 
tion path for an important subset of the class of generalized linear models. 

* Trevor Hastie was partially supported by grant DMS-1407548 from the National Science Foundation, 
and grant R01-EB001988-15 from the National Institutes of Health. 
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Our proposed estimator selects between fitting each fj as zero, linear, or nonlinear, as 
determined by the data. In so doing it retains the interpretability advantages of linear fits 
when appropriate, while capturing strong non-linear relationships when they are present. 
Our method, which we call GAMSEL (Generalized Additive Model Selection), is based on 
optimizing a penalized (negative) log-likelihood criterion of the form, 

p 

/i,..., /p = argmin £{y, fi,... Jp) + V J(/j) 

fu-Jp^T ^ 

for a particular choice of J{f). We give the form of the penalty term in Section 2.3, after 
presenting some motivating preliminaries. 

1.1. Related literature. Since the introduction of the lasso in Tibshirani (1996), a signif¬ 
icant body of literature has emerged establishing the empirical and theoretical success of ii 
penalized regression in high-dimensional settings. The method we introduce in this paper 
can be viewed as one possible extension of the lasso to the additive model setting. 

There have been numerous other attempts at this extension. The methods most closely 
related to our proposal are COSSO (Lin et ah, 2006), SpAM (Ravikumar et ah, 2007), and 
the method of Meier et ah (2009). Each of these proposals is based on penalized likelihood 
and the difference comes from the choice of J- and the structure of the penalty term J(/). 
COSSO models the components fj as belonging to a reproducing kernel Hilbert space 
(RKHS), and operates by penalizing the sum of the component RKHS-norms (instead of 
the sum of squared norms). SpAM is effectively a functional version of the group lasso (Yuan 
and Lin, 2006); it decouples the choice of smoother from the sparsity constraint, and is thus 
broadly applicable to any choice of smoothing operator. The penalty function proposed in 
Meier et ah (2009) is the quadratic mean of the component function norm and a second 
derivative smoothness penalty, summed over the components. The authors argue that the 
quadratic mean penalty structure enjoys both theoretical and computational advantages 
that closely related formulations do not. 

Our method is distinct from the existing proposals in that it selects between linear and 
non-linear fits for the component functions. Later in the paper we present experimental 
results showing that GAMSEL performs nearly as well as SpAM when the true component 
functions are all non-linear, and can perform considerably better when some of the true 
component functions are linear. 

During the writing of this manuscript we became aware of the SPLAM method being 
developed independently by Lou, Bien, Garuana, and Gehrke (Lou et ah, 2014). This method 
also selects between zero, linear and non-linear fits for component functions in a generalized 
additive model framework, but differs in the choice of penalty function. 

Model selection is of course a very old problem in statistics, and there are several pop¬ 
ular methods for variable and smoothness selection that were developed without the high¬ 
dimensional or sparse setting in mind. Of particular note is the AIG-based stepwise selection 
procedure implemented in the gam R package (Hastie, 2015), as well as the methods imple¬ 
mented in the widely used mgcv package (Wood, 2011). Later in the paper we compare the 
selection performance of GAMSEL and step.gam. The methods implemented in mgcv have 
a different focus in that they are geared more for smoothness selection rather than variable 
selection. 
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1.2. Layout of the paper. We begin in Section 2 by giving a brief overview of smooth¬ 
ing splines in order to motivate the form of the penalty term in the GAMSEL objective. 
The full form of the penalized likelihood objective is introduced in Section 2.3. Sections 3 
and 4 present technical details concerning the fitting procedure. Section 3 describes the 
basis and penalty matrix construction used by the procedure, and Section 4 presents the 
blockwise coordinate descent algorithm for fitting the GAMSEL model. In Section 5 we 
investigate the model selection performance of our method on a simple simulated example, 
and present the results of fitting GAMSEL to the Boston Housing and HP spam datasets. 
In Section 6 we compare the model selection and prediction performance of GAMSEL to 
that of two competing methods: SpAM and step.gam. Section 8 wraps the paper up with 
some conclusions. 


2. Method. In this section we describe the penalized optimization problem underlying 
GAMSEL. We begin with a brief review of smoothing splines to help motivate the form of 
the penalty term. This is followed by a brief overview of generalized additive models. For a 
more comprehensive treatment of both of these subjects we refer the reader to Green and 
Silverman (1993), Hastie and Tibshirani (1990) or Wood (2011). 


2.1. Overview of Smoothing Splines. Gonsider the univariate regression setting where 
we observe data on a response variable y and a single predictor x. We assume that x takes 
values in some compact interval on the real line. Given observations (yi,xi),..., {yn,Xn) 
and a choice of A > 0, the smoothing spline estimator, f\, is defined by 


( 2 . 1 ) 


n 

fx = argmin Y](yi - f{xi)f + A 


fitfdt 


While at hrst glance this may appear to be a difficult optimization problem due to the 
uncountably infinite parameter space C^, it can be shown that problem reduces to a much 
simpler form. The solution to (2.1) is a natural cubic spline with knots at the unique values 
of X. It can be shown that the space of such splines is n-dimensional, which allows us to 
express the smoothing spline objective as a hnite dimensional problem. 

Given a set of natural spline basis functions {hi,h 2 ,... ,hn}, we can re-express the ob¬ 
jective function f f as f{x) = hj{x)6j. Forming Hnxn = we note that 

solving (2.1) amounts to estimating 8 according to, 

(2.2) 9x = aigminWy — H9\\2 + 

0eK" 


where the penalty matrix Q has entries Lljk = f hj(x)h'^(x}dx. Note that fl depends on 
the hj, which in turn depend on the Xi, but not on y or 0. This is just a generalized ridge 
regression problem, so we know that the the n-vector of fitted values^ has the form, 

A = ffffx = Sxy. 


The matrix Sx appearing above is commonly referred to as the smoother matrix. To better 
understand the action of Sx, it helps to take a special choice of spline basis. 

'^We are overloading notation here by using fx to denote both the full smoothing spline function and the 
n X 1 vector of fitted values. 
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While the penalty matrix is in general not diagonal, it becomes diagonal if we take our 
basis to be the Demmler-Reinsch basis (Demmler and Reinsch, 1975), which we’ll denote 
by {ui{x), U 2 {x),..., Un{x)}. We’ll also denote the now-diagonal penalty matrix by D = 
diag(di, d 2 , ■ ■ ■, dn)- Assuming that the Uj are ordered in increasing order of complexity, D 
has the property that 0 = di = d 2 < d^ < ... < dn- di and d 2 correspond to the constant 
and linear basis functions, respectively. The non-zero dj are associated with Uj that are 
non-linear functions of x, with higher indexes corresponding to Uj with greater numbers of 
zero-crossings. Furthermore, the n x n matrix U with columns the Uj is orthonormal. ^ 

In the new choice of basis, the estimation problem (2.2) reduces to, 

(2.3) 6 \ = SLigv[mi\\y — U9\\\ +X9^ DO 

= {I + \D)-^U^y 

With further rewriting, we have that the smoothing spline solution f\ takes the form, 

(2.4) h = U{I + \D)-^U^y = j2y^,pf^ 

'• -V-^ 1 + Xdj 

Since di = d 2 = 0, we see from (2.4) that S\ has the effect of keeping the components along 
the first two basis functions intact (i.e., unpenalized) and shrinking the later components 
by a factor of Tj = < 1. The degrees of freedom for a smoothing spline is defined to 

be df^ = D — and lies between 2 and n. 

Using this insight we can rewrite (2.3) by separating out the constant term m and linear 
term U 2 to obtain the optimization problem 

(2.5) {ao,a,$)= argmin \\y - uq - ax - U 3 -,n/ 3 \\l + D 3 .,n(d, 

ooeR,aeR,/3eR"-2 

where U 3 :n is the matrix with columns U 3 ,... ,Un and D 3 -n = diag((i 3 , d^,..., dn)- So the 
smoothing spline fits a mix of terms linear in x and nonlinear in x, and the nonlinear terms 
are penalized by an amount that increases with their complexity. It is this formulation of the 
problem that most directly motivates the GAMSEL objective we introduce in Section 2.3 
Before moving on, it is instructive to look at the rate of decay of the shrinkage factors 
Tj, which are displayed along with the corresponding Xdj in Figure 1 for a smoothing spline 
with 5 degrees of freedom. As we can see, the shrinkage parameters tend to decay rapidly to 
0, which means that high order basis functions generally have very little effect on the model 
fit. This suggests that we would not lose much by replacing (2.3) with a truncated version 
where we only consider the first m basis functions for m n- For instance, in the case of 
5 degrees of freedom we may be comfortable proceeding with m = 10 basis functions. 

2.2. Overview of Generalized Additive Models- Although generalized additive models 
(Hastie and Tibshirani, 1990) are dehned for general smoothers and distribution families, 


^We can find a matrix A such U = HA is orthonormal, and AAQ-A is diagonal, and elements increasing. 
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Fig 1; Values of the first 15 shrinkage factors tj (left) and adjusted penalty values Xdj 
(right) for a smoothing spline with 5 degrees of freedom. 


we will limit our discussion to the use of smoothing splines and the Gaussian model. Consider 
the extension of criterion (2.1) when there are p predictors 


n p ^ r 

(2.6) minimiz^e ^(?/i - ^ fj{xij)f + A ^ 

l/jec J 

Using extensions of the arguments for smoothing splines, one can show that the solution 
is hnite dimensional, and each function fj can be represented in a natural spline basis in 
its variable Xj. Since each function includes an intercept, we typically isolate the intercept, 
and insist that each fitted function average zero over the training data (without any loss of 
generality). With some simplification of the notation in (2.5), this leads to the optimization 
problem 


(2.7) 


minimize 

oo6M, 


p 

-ao-Y^ Ujfdj 


\l + XyjjD,(3,. 


i=i 


The matrix Uj represents an orthonormal Demmler-Reinsch basis for variable Xj] each of 
its columns are mean centered (orthogonal to the column of ones), and the first column is a 
unit-norm version of Xj. Dj is a diagonal penalty matrix as in the previous section, except 
only the first element is zero (since the intercept term is not represented here). 

In principle this problem is a large generalized ridge regression, although naive com¬ 
putation would lead to an 0{{np)^) algorithm. Since a single smoothing spline can be 
implemented in 0(n) computations, Hastie and Tibshirani (1990) proposed the backfit- 
ting algorithm for fitting the model. This is a block coordinate descent approach, with p 
blocks, each costing 0{n) computations. Hence the entire algorithm is 0{knp), where k 
is the number of backfitting loops required for convergence. Wood (2011) instead reduces 
the dimensionality of each basis Uj as eluded to in the previous section, and hence reduces 
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the computations of the generalized ridge regression. Marx and Eilers (1998) also use a 
truncated bases of P-splines to represent penalized splines and generalized additive models. 

Both the gam package (Hastie, 2015) and the mgcv package Wood (2015) in R can fit 
GAMs with smoothing splines. The step.gam function in gam allows the use to perform 
model selection via a step-wise procedure. For each variable, the user provides an ordered 
list of possibilities, such as zero, linear, smooth with 5df, smooth with 8df. Starting from say 
the null model (where all terms are zero), and in forward mode, the procedure tries a move 
up the list for each term, and picks the best move (in terms of AIC). While quite effective, 
this tends to be laborious and does not scale well. The function gam. selection in package 
mgcv is designed to select separately the correct amount of smoothing for each term, rather 
than feature selection. 


2.3. GAMSEL Objective Function. With this motivation in mind, we now present our 
criterion for fitting a GAM with built-in selection. For ease of presentation we focus here on 
the case of squared-error loss. The extension to the logistic regression setting is presented 
later in Section 4.2. 

We have data (y*, Xj) for i = 1,..., n. We represent the mean function for the jth variable 
as a low-complexity curve of the form 

fj{x) = ajXj Uj{xjY 13 

where Uj is a vector of mj basis functions. Let Uj G be the matrix of evaluations 

of this function at the n values {xij}f, and assume without loss of generality that Uj has 
orthonormal columns. 

GAMSEL estimates the fj by solving the convex optimization problem 


( 2 . 8 ) 


y-ao-Y^ ajXj - Y 


i=i 




p 

+ XY(3\aj\ + {l-'y)\\/3j\\D*'^ + 


j=i 


selection penalty 


i ^2’PipDA 

i=i 

-V-^ 

end-of-path penalty 


where ||/?j||_D* = have taken some liberties with the notation in (2.8); both 

y and Xj are now n vectors. 

Let us first focus on the end-of-path penalty, which is all that is enforced when A = 0. The 
criterion is now equivalent to (2.6) (even though the linear terms are represented twice!). 
The multiplier ijj for each term is chosen so that the fit for that term alone would result in 
a pre-specified degrees of freedom. Hence when A = 0, we fit a generalized additive model 
with pre-specified degrees of freedom for each term. 

The selection penalty is more complex, and consists of a mixture of 1-norm and 2-norm 
penalties for each term. These take the form of an overlap group-lasso penalty (Jacob, 
Obozinski and Vert, 2009), which has the effect of inducing sparsity in the fitted model. 
The term ||/3j||z)* is a group-lasso penalty Yuan and Lin (2006); it behaves like the lasso 
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but for a vector of coefficients. This penalty either includes all the parameters, or sets them 
all to zero. The overlap refers to the fact that each Xj has a a pair of linear coefficients, one 
represented in WoWd*-, the other in \aj\. Here the matrix Dt is identical to Dj, except the 
0 in position one is replaced by a 1 (so the linear term in Uj is penalized); we discuss these 
in more detail later in this section. The parameter 7 is between 0 and 1, and allows one 
to favor linear terms (small 7 ) over non-linear terms, or vice versa. Due to the particular 
structure of this penalty, there are three possibilities for each predictor. 

Zero {aj = 0,13j = 0). For large values of the penalty parameter A, the penalty term can 
dominate the lack-of-fit term, which results in the minimizer having both aj = 0 and 
Pj = 0. This corresponds to the case fj{x) = 0. 

Linear (aj 7 ^ 0,f3j = 0). For moderate values of the parameter A and sufficiently small 
7 > 0, the minimizer can have aj 7 ^ 0 and /3j = 0. This corresponds to the case where 
fj{x) = ajX is estimated to be a strictly linear function of x. 

Non-linear {(3j 7 ^ 0). For small values of A and/or large values of 7 , the minimizer can 
have 13j 7 ^ 0. This corresponds to fitting a low-complexity curve of the form fj{x) = 
ajX -\- Uj(3j for the jth. predictor. Note that depending on the choice of 7 , Oj may or 
may not be 0, but because of the overlap (the first column of Uj is a linear term), this 
implies a linear term is present. 

We refer to these choices as sticky points since certain thresholds have to be met to cross 
from one state to the next. Typically, as we relax A, there is a transition from zero through 
linear to nonlinear, settling finally on the end-of-path nonlinear term with pre-specified 
degrees of freedom. Figure 2 illustrates on a synthetic example. 

Now we explain the D*- in more detail. The Dj and hence D* are customized versions of 
the diagonal D matrix from the previous section, separately for each variable Xj. The first 
element of Dj is zero, and all subsequent elements are positive and increasing (see the right 
plot in Figure 1, and recall there is an extra zero there for the intercept). We have scaled 
these elements of Dj so that the second element is 1 (the penalty for the first nonlinear 
term). Hence in Dj, when we replace the zero in the first entry by a 1, this means that 
as far as this penalty is concerned, a linear term is treated the same as the first nonlinear 
component. 

With these details in mind, we can now understand the selection penalty better. If a 
function fj is really linear, it can enter via selection through Oj or the first coefficient of /3j; 
with 7 = 0.5, selection through f3j picks up unwanted penalty on the other components of 
(3j which will model the noise, and so the linear term is more likely to enter via aj. With 
7 < 0.5 we can encourage this even more strongly. 

In principal, once can fine-tune these penalties even further. For example, with the group 
lasso and our parametrization, one can argue that each penalty ||/?j||_D* should be multiplied 

by the scalar cpj = yjtv{D*~^), to put each of these terms on an equal footing. Our stan¬ 
dardization does something similar, while allowing for easier understanding of the treatment 
of the linear term. 

We provide an R implementation for fitting this model along a dense path of values of 
A — essentially the entire regularization path for (2.8). We describe the package gamsel in 
Section 7. 
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First 30 steps of GAMSEL 



Fig 2: Synthetic example with 12 variables, one panel per variable, to illustrate the sticky 
nature of the GAMSEL procedure. Gaussian noise was added to an additive model in these 
twelve uniformly-distributed variables. The true functions are shown in black. In the first 
column, the three true functions are linear, in the second nonlinear, and in the remaining two 
columns all six terms are zero. We show the progression (via overplotting) of the GAMSEL 
procedure in its first 30 steps (out of 50), corresponding to decreasing values of A. The 
estimated functions are in color: blue means zero, green means linear, and red nonlinear. 
We see that 7 functions are exactly zero, two linear terms are well approximated, and the 
nonlinear terms are reasonably well approximated, some better than others. 
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3. Generating the spline bases. In this section we discuss how we produce the basis 
matrix U and associated penalty matrix D for any variable x. The idea is to approximate 
the truncated Demmler-Reinsch basis and penalty for a smoothing spline. While this can 
be achieved in a variety of ways, we adopt the pseudo-spline approach of Hastie (1995), 
which is particularly simple. It hinges on the fact that the Demmler-Reinsch basis shares 
the zero-crossing behaviour of orthogonal polynomials. Let S\ denote the nxn smoothing- 
spline operator matrix for smoothing against x, and assume that A has been chosen to 
achieve a fixed df = trace(5A). Our goal is to produce a low-rank approximation to Sx] 
that is, we want to find a nx k basis matrix U, and a diagonal penalty matrix D such that 
Sx^U{I -\- XD)~^U'^. A typical value for k is 10. 

Let P be the n x k matrix of orthogonal polynomials generated from x. The version of 
pseudo-splines we use here solves 

(3.1) mm\\Sx- PMP^Wf. 

M 

The solution is easily seen to be M = P"^SxP. We then compute the eigen-decomposition 
M = VDsV'^, and let U = PV and D = - L 

The approximation is easy to compute, since SxP simply executes the smoothing spline 
k — 2 times (the first two columns of P are constant and linear, and so pass through Sx 
untouched. The smoothing spline-algorithm we employ can apply the smoother to a matrix, 
so the setup is not repeated for each column of P. See the reference for more detail. The 
first two penalty elements are Du = D 22 = 0; the remaining elements are positive and 
increasing. In practice, we discard the first element (corresponding to the intercept), and 
this leads to the bases Dj and Uj referred to in (2.6) and (2.8). 

The basis U found by solving (3.1) lies in the space of polynomials; Hastie (1995) describes 
a simple improvement that requires a bit more computation. In the first step we replace P 
by Q where QR = SxP (QR decomposition). Now Q is an orthogonal basis in the space of 
natural splines in x. Solving (3.1) with Q rather than P leads to a better approximation 
to Sx (in Frobenius norm). In practice, the improvements are very small, and generally not 
worth the effort. Our package gams el allows either of these options. 

3.1. Prediction at new points. We have generated these bases (and penalties) for the n 
training observations. How do we use them to make predictions at new points? It suffices 
to consider the case of a single variable x. 

Armed with a basis U and associated diagonal penalty matrix D, we fit a smooth term 
to a response vector y by solving 

(3.2) mm\\y-U/3\f2 +X/3^DP, 

or a similar quantity in the case of logistic regression. To make predictions at a vector of 
new points xq, we need to generate the corresponding matrix Uq, and then the predictions 
are /o = Uo(d. Well, Uq = PqV, and Pq is the matrix of orthogonal polynomials evaluated 
at xq. If we used Q rather than P to generate the basis, we have to use the appropriate 
matrix Qo obtained by making predictions from the smoothing spline fits that produced Q. 
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3.2. Subsetting a basis. We compute all the bases for each variable Xj in advance, and 
these are used throughout the GAMSEL computations. Our software assumes that the Uj 
for each variable are orthonormal matrices. When we run cross-validation to select the 
tuning parameter(s), it might seem that we have to regenerate the bases each time, since 
the subsetted Uj will no longer be orthonormal. It turns out that this is not necessary, nor 
desirable. We would like to work with the same basis as before, and simply modify the 
problem accordingly. Let Ui be the non-orthonormal matrix obtained by removing a subset 
T of of the rows of U, and let yi be the corresponding subset of y. The full-data problem 
solves (3.2), which can be written as 

(3.3) min^Wy-UP\f2 +X^p'^D/3, 

where A* = \/N. It makes sense for the reduced-data estimate to solve 

(3.4) mm^\\yi-UiP\\l + X^P^D/3. 

j3 

Now this is not in the required orthogonal form, but can easily be transformed. It is easy 
to show that the following steps produce a and a D* that produce exactly the solution 
to (3.4), but with in the required orthogonal form. 

1. Compute the SVD UiD-^ = U^D 2 V^. 

2. D* = D2^. 

Furthermore, since Uf is a linear transformation of Ui, which is itself a linear transforma¬ 
tion of Pi (corresponding subsetted version of P), the information for predicting from this 
reduced basis can be saved. Note that if the problems are solved in the unstandardized form 
(3.2), then the penalty A should be transformed to ^A for the reduced problem. Note also 
that while this works fine for a problem with quadratic penalty, our criterion (2.8) treats 
the linear term in a special way. However, since the linear penalty in D is zero, this term is 
isolated and is not rotated in the construction of U^. 

4. Blockwise Coordinate Descent Algorithm. In this section we describe a block- 
wise coordinate descent algorithm for fitting the GAMSEL model in (2.8). Our routine re¬ 
turns parameter estimates along a sequence of lambda values, Amax = Ai > A 2 > ... > Xl. 
We take advantage of warm starts by initializing the estimates for A^+i at the solution 
for Afc. We also use sequential strong rules (Tibshirani et ah, 2012) to provide considerable 
speedups by screening out variables at the beginning of each A iteration. The form of the 
sequential strong rules is described at the end of the section. 

4.1. Blockwise descent (squared-error loss). This section describes the update steps for 
the blockwise coordinate descent algorithm. Optimizing over aj for a fixed j given the 
current values of the other parameters amounts to a simple one-dimensional lasso problem 
whose solution is given in closed form by soft thresholding. The optimization step for (dj 
is rather more involved, and we find that we are unable to obtain an entirely closed form 
solution to the corresponding subgradient equation. We instead give a closed form solution 
up to a quantity that is calculated via a simple one-dimensional line search. 
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In deriving the update steps we assume that the predictors have been centred and nor¬ 
malized. 

Intercept term. Since the predictors are centred and the intercept term is left unpenalized 
in the objective, the intercept estimate is given by 


1 

ao = — 
n 




(7) 

Optimization in a. Given current estimates (Xk for k ^ j and Pk for ah k, let ra' denote 
the residual vector 


^ = y- I «0 + ^ akXk + ^ Uk^k 

k^j k=l 

With this notation, the optimization problem (2.8) reduces to a simple one-dimensional 
lasso problem, 

oij = argmin - \\r^J'^ — ajXj \\2 -t-yAlojl 
“i 2 

The solution to this is known to be given by, 

(4.1) dj = 5(xJrW;7A) 

where S{y;t) is the soft-thresholding function defined as S{x] A) = sgn(x)(|x| — A)+. 

Optimization in (3. Given current estimates jSk for k ^ j and dfc for all k, let denote 
the residual vector 


U) 

rp = y 


E Uk^k 

k=l kj^j 


As a function of /3j, the optimization problem (2.8) reduces to 


= arg min - 


rf - UjPj 


+ A(1 - l)\\Pj\\D* + 


In order to simplify notation we temporarily drop the explicit dependence on j and make 
the substitutions A = A(1 — 7 ), 0 = = (0,02, • • • ,&mj) and V = so 

that the above equation becomes 

9 = argmin ^ ||r - V9f + A||0||2 + 7V'0f_i)0(-i) 

Differentiating the objective with respect to 9 gives the subgradient equation, 

(4.2) - V^{r - V9) + As + = 0 

where s is in the subgradient of the £2 norm: 


G {u G 


\u\\2 < 1 } 


if 0 = 0 


s = 


otherwise 
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From the subgradient equation (4.2), we see that 0 = 0 is a solution if ||l^^r ||2 < A. 
Otherwise, the subgradient equation becomes 

_v^^r-ve) + ^e^_,) + x^^=o 

Some algebraic manipulation along with the substitution D = V'^V + = D*~^ + 

produces the simpler expression, 

A3) + 

If the matrix D were of the form D = cl, then this equation would have a simple closed 
form solution for 6. This condition is typically not satisfied, so we proceed by first solving 
for c = ||0||2 and then substituting this quantity back into (4.3) to solve for 6. 

Taking the squared norm of both sides of (4.3) and doing some algebra gives 



Solving for c amounts to carrying out a simple one-dimensional line search. 

Let c denote the solution to (4.4). Reversing all the substitutions we made along the way, 
we conclude that the update step for looks like 

fo, if ||T>*"^/^C/Jr^^'^|| 2 <A(l- 7 ) 

^ ^ I D*~^ujr^p, otherwise 

4.2. Logistic regression. In this section we describe the blockwise gradient descent algo¬ 
rithm that we use for logistic regression. The log-likelihood function for logistic regression 
has the form. 


n / p 

6 ) i{ao,a,l3) = '^yi\ao + '^fj{xi) 

i=i \ j=i 

A common approach is to use a Taylor expansion about the current parameter estimates 
to obtain a quadratic approximation to the log-likelihood. This results in a weighted residual 
sum of squares expression of the form, 

/,(x.)) +C 


I n / p 

iQ{ao, a,/3) = - '^Wi - oq - ^ 
i=i \ j=i 



(4.7) 
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where 



(4.8) 

Zi - ao + '^fjixi) + 

^ Wi 

7 — 1 

(working response) 

(4.9) 

Wi =p{Xi){l -p{xi)) 

(weights) 

(4.10) 

p{xi) = h 



See Hastie and Tibshirani (1990) for details. 

The full penalized weighted least squares objective then has the form 


(4.11) 


1 

2 


14/1/2 


z — ao 


i=i 




/=! / 2 

+ -^X](7|ail + (1 -7)ll/3il|D*) + 

J=i j=i 


It turns out that, for general values of the weights Wi, the update step for the /3j terms 
amounts to solving an equation having the same form as (4.4), with the added difficulty that 
now D is non-diagonal. While the resulting equation is possible to solve (say, by evaluating 
the eigen-decomposition of D), doing so incurs a greater computational cost than incurred 
in the linear regression setting. 

Our approach is to instead use a coarser majorization, replacing all of the weights with 
Wi = 0.25, the maximum possible. This reduces (4.11) to the same objective as in the linear 
regression case, with response given by the working response vector, z. In summary, we 
proceed as follows. 

Outer loop: Decrement A. 

Middle loop: Update the quadratic approximation £q using the current values of the pa¬ 
rameters, {ao: Oj, setting Wi = 0.25. 

Inner loop: Run the coordinate descent procedure of §4.1 on the penalized least squares 
problem (4.11) until convergence. 


4.3. Strong rules. Strong rules as introduced in Tibshirani et al. (2012) are an effective 
computational hedge for discarding inactive predictors along the A path. Given the solution 
at Afc_i, strong rules give a simple check for screening out predictors that are likely to be 
inactive in the solution at Afc < Xk-i- The idea is to ignore these discarded variables, fit the 
model, and then confirm their omission. If any are omitted in error, they are then included 
in the active set, and the fitting is repeated. In practice, this refitting is very rarely needed. 

Letting y(A) denote the fitted values at penalty parameter value A (probabilities in the 
case of logistic regression), the sequential strong rules for GAMSEL can be shown to be 

Strong rule for aj: Discard aj if 


xj{y - y{Xk-i))\ < 7(2Afc - Afc_i) 
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Strong rule for f3j: Discard f3j if 

\\uj {y - y{\k-i)) + < (1 - 7)(2Afc - Afc_i) 

In the implementation of GAMSEL we adopt the approach recommended by the authors 
in section 7 of Tibshirani et al. (2012), which takes advantage of both active set methods 
and strong rules screening. 

These strong-rule formulas also expose to us the the value of Xmax, the largest value of 
A we need to consider in the GAMSEL regularization path: 


(4.12) 


xjyl \\Ujyh 

max —^- 

j 7 ' 3 1-7 


Xmax = max max ' ^ ^ 


Eor any values of A > Xmax, ah terms in the model bar the intercept are zero. 


5. Experiments. We now demonstrate the GAMSEL procedure on synthetic and real 
data. 


5.1. Simulation study. We begin this section with a simple simulation where we have 
n = 200 observations on p = 30. Variables 1-6 are linear; 7-10 are polynomials of degree 5, 
and the remaining 20 variables (indexes 11-30) are zero. All coefficients in the simulation 
are generated randomly. We begin by running GAMSEL with 7 = 0.4 and use 10 basis 
functions with 5 degrees of freedom for each of the 30 variables. 

Figure 3 shows plots of the fitted coefficients of 6 variables for several values of the penalty 
parameter A. At A 25 all six variables shown are correctly classified as linear, nonlinear 
or zero. By A40, variable 3 gets a non-linear ht, as does the noise variable at index 16. 
The estimated functions for variables 9 and 10 become increasingly better matches for the 
underlying signals as A decreases. 

We repeat the simulation 500 times at three values of 7 : 0.4, 0.5, 0.6. Figure 4 shows 
plots of misclassification error across the full sequence of A values for the three choices of 
7 . Four types of misclassification are considered: 

Zeros. The fraction of truly zero variables that are estimated as non-zero (i.e., as linear or 
non-linear). 

Linear. The fraction of truly linear variables that are estimated as zero or non-linear. 
Non-linear. The fraction of truly non-linear variables that are estimated as zero or linear. 
Zero vs. Non-zero. The misclassification rate between the categories zero and non-zero. 

We see from the Figure 4 that the Zero vs. Non-zero misclassification rate as well as 
the Zeros misclassification rate are fairly insensitive to the choice of 7 . By design, Linear 
and Non-linear misclassihcations are very sensitive to the choice of 7 . Since non-linear fits 
can be used to represent linear functions but not vice-versa, taking 7 much larger than 
0.5 will result in all nonzero fits being non-linear. This behaviour is observed in the panel 
corresponding to 7 = 0.6 in the present simulation. 

5.2. Boston Housing. The Boston Housing Dataset contains housing information on me¬ 
dian house values for n = 506 census tracts in the Boston Standard Metropolitan Statistical 
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Fig 3: Estimated fj for 6 variables across four A values for a single realization of the sim¬ 
ulation described in 5.1. Linear fits are displayed in green; nonlinear fits are displayed in 
blue. At A 25 all 6 variables are correctly identified as linear, non-linear and zero. 


Area. Of the 13 covariates present in the dataset, p = 10 are sufficiently continuous to be 
meaningfully analysed within the GAMSEL framework. These variables are: CRIME, IN¬ 
DUS, NOX, RM, AGE, TAX, PTRATIO, B, and LSTAT. In addition to considering the 
original variables in the dataset, we follow the approach of Ravikumar et al. (2007) and 
generate 20 noise variables to include in the analysis. The first ten are drawn uniformly 
from the interval [ 0 , 1 ], while the remaining ten are obtained as random permutations of 
the original ten covariates. We fit GAMSEL with 7 = 0.5 using 10 basis functions with 5 
degrees of freedom for all 30 covariates. Eigure 5 summarizes the regularization plots, which 
display aj and ||/ 3 j ||2 across the sequence of A values. This plot shows that there are 5 strong 
predictors that enter well before the ‘bulk’: LSTAT (nonlinear), RM (nonlinear), PTRATIO 
(linear), CRIME (linear), and B (linear). TAX and NOX also enter the model before any of 
the noise variables, but they are somewhat borderline. Eigure 6 shows estimated functions 
for LSTAT, RM, PTRATIO and CRIME at A 29 = 24.1, right before the noise variables 
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X 


X 


(a) 7 = 0.4 


(b) 7 = 0.5 


(c) 7 = 0.6 


Fig 4: This figure illustrates the classification performance of GAMSEL on the four misclas- 
sification measures described in section 5.1. We observe that the overall Zero vs Non-zero 
misclassihcation curve is not sensitive to the choice of tuning parameter 7 . As we vary 7 , 
the main difference is in how the terms are classified between linear and non-linear. Since 
non-linear terms can be used to represent linear ones, it is in general inadvisable to take 
7 > 0.5. 


Linear Components Non-linear Components 




X 


X 


Fig 5: Regularization plots for the Boston Housing data. Solid lines correspond to the 
original 10 variables; dashed lines are from the simulated noise variables. 


enter. The fitted curves do a good job of capturing the trends in the data. 

5.3. Spam Data. Next we apply GAMSEL to a spam classihcation problem using the 
Hewlett-Packard spam dataset. This dataset consists of textual information from n = 4601 
email messages, each of which is labelled spam or email. There are 57 predictors, of which 
48 give the percentage of words matching a given word (e.g., ‘free’, ‘hp’, ‘george’); 6 give 
the percentage of characters that match a given character (e.g., $, !); and 3 measure unin¬ 
terrupted sequences of capital letters. We conduct two analyses of the spam data. First, we 
analyse the data using the test set and training set partitioning described in section 9.1.2 of 
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Fig 6: Boston Housing: Estimated functions at A 28 = 24.1. 
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-logW -log(X) 

Fig 7: Minimum test error is 7.0% for the plot shown. Ise A from CV gave a test error of 
5.5% 


Hastie et al. (2009). Using the 3065 observations in the training set, we fit GAMSEL with 
7 = 0.5, taking 10 basis functions with 4 degrees of freedom for all 57 covariates. The tuning 
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parameter A is chosen via cross-validation using the 1 standard error rule. The left panel of 
Figure 7 shows the full cross-validation curve along with the test set misclassification error 
curve across the entire sequence of A values. We see that the one-standard-error rule selects 
the model corresponding to A 39 = 0.362. The selected model has cross-validated misclas¬ 
sification error of 4.8%, and test misclassification error of 5.5%. This compares favourably 
with test set error rate of 5.3% reported in Hastie et al. (2009), as found by step.gam. 

In our second analysis we further sub-sample the data to obtain a training sample of 
n = 300 messages, with the remaining 4301 messages used for testing. Due to the extreme 
skewness in some of the covariates, the training sample contained several variables for which 
as few as 3 unique values were observed. For the purpose of this analysis we used (degree, df) 
G {(10,4), (4, 2), (1,1)}, as determined by the number of unique values observed in the test 
sample for the given variable. The right plot of Figure 7 shows the test set misclassification 
error plotted against the penalty parameter A for a single realization of the training set. 
The minimum achieved test error is 7.0%. 

6 . Comparisons to other methods. In this section we compare GAMSEL with three 
other approaches to carrying out selection and estimation for generalized additive models: 

1. the SpAM method of Ravikumar et al. (2007), 

2 . the step.gam routine in the gam package, and 

3. the gam.selection procedure in the mgcv package. 

The SpAM method does not separate linear from nonlinear terms, but instead identifies 
terms to be set to zero. It is a modified form of backfitting, where each fitted function is 
thresholded via a group-lasso penalty each time it is updated. See Hastie, Tibshirani and 
Wainwright (2015, Chapter 4) for a succinct description. 

6.1. GAMSEL vs. SpAM. We begin by comparing the model selection performance of 
GAMSEL and SpAM in the simulation setting described in 5.1. As before, we have n = 200 
observations on p = 30 variables, of which 10 are non-zero. We considered three simulation 
sub-settings: #(linear, nonlinear) = (a) (10,0) (b)( 6 ,4), (c)(0,10). We ran GAMSEL with 
10 basis functions and 5 degrees of freedom and 7 = 0.5, and using the SAM package we 
ran SpAM with the default k = 3 basis spline functions. In our experimentation we also 
explored taking k > 3, but we observed that this resulted in a deterioration in the model 
selection performance of SpAM. We compared GAMSEL and SpAM based on the false 
discovery rate, defined as the fraction of variables estimated as non-zero that are truly 0 
in the underlying model. To put the methods on equal footing, we compared their EDRs 
at equal model sizes. Figure 8 shows a summary of the results. Six curves are shown in 
each subplot, one for each method across three different values of the noise variance. In 
this comparison we find that GAMSEL performs considerably better in cases (a) and (b), 
where some of the nonzero variables are linear. The only case where SpAM is observed to 
outperform GAMSEL is in the low noise regime in case (c), where all nonzero effects are 
nonlinear. Though we do not present the results here, we note that this trend was observed 
to persist at other values of noise variance and problem size (i.e., other choices of 
linear, and # nonlinear). 

Next we revisit the spam dataset and use it to compare the prediction performance of 
GAMSEL and SpAM. For this comparison we use the simulation setup described in the 
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(a) #(lin, nonlin) = (10,0) 


(b) #(lin, nonlin) = (6,4) 


(c) #(lin, nonlin) = (0,10) 


Fig 8 : SpAM vs. GAMSEL (7 = 0.5) on simulated data example with n = 200 observations 
and p = 30 variables, of which 10 are nonzero. Three scenarios are considered: (a) all 10 
nonzero variables are linear; (b) 6 variables are linear, and 4 are nonlinear; (c) all 10 nonzero 
variables are nonlinear. In cases (a) and (b) GAMSEL has a lower EDR than SpAM at all 
noise levels considered. In case (c), where the ground truth is that all nonzero terms are 
nonlinear, SpAM slightly outperforms GAMSEL in the low noise regime. 


second part of section 5.3. Starting with the full set of 4601 messages, we randomly sample 
300 of them for use in training, and set the remaining 4301 aside for testing. The sampling 
is repeated 50 times. 

To put the methods on equal footing, we once again make the comparison in terms 
of the size of the selected model along the path, rather than the underlying value of the 
regularization parameter. Since several penalty parameters can correspond to the same 
selected model, we record for each simulation instance the minimum test set misclassification 
error at each size of selected model. 

Eigure 9 shows boxplots of the minimum test error at each (even) model size across the 50 
simulation iterations. The main difference we observe comes at the extreme selected model 
sizes. Eor small model sizes, SpAM tends to have a lower test error, while for large model 
sizes GAMSEL has lower test error. The methods perform comparably in the middle range 
of model sizes. A closer inspection would reveal that GAMSEL tends to achieve a lower 
minimum test error overall. 

6.2. GAMSEL vs step. gam. The step. gam function in the gam R package can be used to 
select between zero, linear and non-linear terms in generalized additive models via stepwise 
model selection. Eor instance, instead of running GAMSEL with 5 degrees of freedom, one 
could try running step.gam initialized at the null model with scope ~1 + x + s(x,5) for 
each variable. Since GAMSEL produces a fit with maximal degrees of freedom only at the 
end of the path when A = 0, the analogy is not perfect. Of course one could always add in 
formula terms of the form s(x,d) for 1 < d < 5, but doing so would increase computation 
time without greatly affecting model selection performance. 

The main disadvantage of step. gam is the poor scalability of the procedure. While GAM¬ 
SEL scales to problems with thousands of observations and variables, step.gam in general 
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Method GAMSEL SpAM 



Fig 9; Test set misclassification error rates for GAMSEL and SpAM. For display purposes, 
results are shown at only even values of selected model size. 


does not. The search space quickly becomes too large for stepwise model selection to be 
computationally feasible. In what follows we compare GAMSEL and step.gam both in terms 
of model selection performance and computation time. 

We begin by comparing the model selection performance of GAMSEL and step.gam 
in the simulation setting described in 5.1. We run step.gam initialized at the null model 
with scope ~1 + x + s(x,5) for each variable, and compare it to GAMSEL with 10 basis 
functions and 5 degrees of freedom, taking 7 G {0.4, 0.5}. The methods are compared on 
the basis of Zero vs. Nonzero, Linear and Nonlinear misclassification rates, as dehned in 
5.1. To put both methods on equal footing the misclassification rates are compared at equal 
values of model size. 

Eigure 10 shows plots of misclassification rates for the two procedures. GAMSEL and 
step.gam perform comparably in terms of Zero vs. Nonzero misclassification rates. In the 
case 7 = 0.5, the two methods also have fairly similar Linear and Nonlinear misclassification 
rates across the range of model sizes. With 7 = 0.4, GAMSEL has considerably lower Linear 
misclassihcation rates, but higher Nonlinear misclassification rates. This trade-off between 
Linear/Nonlinear classification and interpret ability may be a desirable one, provided that 
the cost in model fit is not too great. 

Next we report the results of a pair of timing experiments conducted to compare the 
computational performance of GAMSEL and step. gam. The first timing experiment reports 
the timings for the simulation described above. For GAMSEL, we record the combined 
computation time expended on forming the bases {Uj} and obtaining model fits for the full 
sequence of A values. For step.gam we record the computation time expended on doing 30 
steps of forward stepwise model selection. 

The left panel of Table 1 shows a summary of the results. Even on this small example 
step.gam takes a fair bit of time to run. In this comparison GAMSEL is over 30 times 
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# of non-zero variables # of non-zero variables 

(a) 7 = 0.4 (b) 7 = 0.5 


Fig 10; Classification performance comparison between step.gam and GAMSEL. Lower 
values are better. Both methods have similar Zero vs. Nonzero misclassification rates, with 
GAMSEL performing slightly better when the selected model size exceeds 10. With 7 = 0.5, 
the two methods also have very similar Linear and Nonlinear misclassification rates. 


faster. 


n = 600, p = 90 

GAMSEL 

step.gam 

9.1 ±0.7 

606 ±32 


n — 200, p = 30 

GAMSEL 

step.gam 

1.24 ±0.1 

41.7 ±0.6 


Table 1 

Average computation time ± standard deviation (in seconds) for GAMSEL and step.gam 


Eor our second experiment we increase the problem size to n = 600 observations and 
p = 90 variables, of which 20 are nonzero (12 linear, 8 nonlinear). We record computation 
times as before, still only running step.gam for just 30 steps of forward stepwise selection. 
Results are reported in the right panel of Table 1. It is clear that step.gam does not scale 
well to larger problems. In particular, while timing results shown are times for just the first 
30 steps, with p = 90 one would typically wish to consider a larger number of steps, which 
would further increase the computation time. 

We also note that roughly 67% of the computation time of GAMSEL is expended on form¬ 
ing the bases {Uj}. Once the {Uj} are constructed, the average time expended computing 
the full solution path is 3.1 seconds. 

6.3. GAMSEL vs gam.selection {mgcv}. Our final comparison is to the automatic 
term selection approach implemented in the mgcv package in R (Wood, 2015). While the 
mgcv package is most often used for automated smoothness selection rather than variable 
selection proper, the package also supports variable selection using an approach that aug¬ 
ments the objective function with an additional penalty term (Marra and Wood, 2011). 
Unlike GAMSEL, this approach returns a single selected model rather than a sequence of 
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models of increasing complexity. To put the methods on equal footing, we compare the 
model selected using gam. selection to the GAMSEL model selected via cross-validation 
using the 1-standard error rule. We compare to each of the hve method options implemented 
in the mgcv package. 

The simulation setup for our comparison is the same as in 5.1. We run the gam fitting func¬ 
tion with the formula y ~ 1 + s(Xl, k=5, bs=‘crG + ... + s(X30, k=5, bs=‘cr’) 
and flag select = TRUE. The choice of k = 5 is made to ensure that n ■ k < p, which is a 
requirement of the mgcv fit. In interpreting the results, we deem that term in the mgcv fit is 
zero if the edf is < 0.1, linear if the edf is in the range [0.1,1.3), and non-linear otherwise. 
Our results are robust to the choice of cutoffs; just about any reasonable choice produces 
the same results. Our findings are summarized in Table 2. GAMSEL does a far better job 
in screening zero or linear terms than gam. selection. 

Table 2 

Comparison of GAMSEL to mgcv automated term selection approaches using the simulation setup described 
in 5.1. Quantities are averages over 100 iterations of the simulation. GAMSEL tends to select smaller 
models than any of the mgcv methods, while having comparable recall and much higher precision on overall 
variable selection. With the choice of y — 0.4, GAMSEL applies a higher penalty to non-linear terms than 
to linear ones, which results in higher recall of linear terms eompared to non-linaer terms. GAMSEL 
considerably outperforms mgcv on all measures except non-linear term recall. 


method 

GAMSEL 

GCV.Cp 

ML 

P-ML 

P-REML 

REML 

ff nonzero terms 

17 

25 

24 

23 

23 

24 

Zero-vs-nonzero misclassification rate 

0.25 

0.51 

0.47 

0.44 

0.44 

0.47 

Precision (nonzero terms) 

0.61 

0.40 

0.42 

0.44 

0.44 

0.42 

Recall (nonzero terms) 

0.97 

0.99 

0.99 

0.99 

0.99 

0.99 

Precision (linear terms) 

0.43 

0.32 

0.34 

0.37 

0.37 

0.34 

Recall (linear terms) 

0.86 

0.52 

0.74 

0.77 

0.76 

0.74 

Precision (nonlinear terms) 

0.69 

0.24 

0.32 

0.33 

0.33 

0.32 

Recall (nonlinear terms) 

0.61 

0.90 

0.84 

0.82 

0.82 

0.83 


7. Gamsel package in R. We have produced an R package gamsel that implements 
GAMSEL for squared-error loss and binomial log likelihood (logistic regression). The bulk 
of the fitting is done in C, which results in fast execution times. Similar in structure to the 
package glmnet, the gamsel function hts the entire regularization path for the GAMSEL 
objective (2.8) over a grid of values of A (evenly spaced on the log scale; see (4.12)). There 
are predict, plot and summary methods for fitted GAMSEL objects, and cv.gamsel per¬ 
forms cross-validation for selecting the tuning parameter A. The package is available from 
http://cran.r-proj ect.org/. 

8. Conclusion. In this paper we introduced GAMSEL, a penalized likelihood pro¬ 
cedure for fitting sparse generalized additive models that scales to high-dimensional data. 
Unlike competing methods, GAMSEL adaptively selects between zero, linear and non-linear 
fits for the component functions. Our estimator therefore retains the interpretational ad¬ 
vantages of linear fits when they provide a good description of the data, while still capturing 
any strong non-linearities that may be present. 

We investigate the model selection and prediction performance of our method through 
both simulated and real data examples. In comparisons to SpAM, a competing method, our 
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experiments indicate that GAMSEL can have notably better selection performance when 
some of the underlying effects are in truth linear. Moreover, GAMSEL was observed to 
perform competitively even when the ground truth is that all effects are non-linear. We also 
showed that our method performs comparably to forward stepwise selection as implemented 
in the gam package, while also scaling well to larger problems. 

The block-wise coordinate descent algorithm for fitting the GAMSEL model in the case 
of linear and logistic regression, along with the plotting and cross-validation routines are 
implemented in the gamsel package in R. 
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